Deep Learning Methods for Predicting Disease Status Using Genomic Data

Predicting disease status for a complex human disease using genomic data is an important, yet challenging, step in personalized medicine. Among many challenges, the so-called curse of dimensionality problem results in unsatisfied performances of many state-of-art machine learning algorithms. A major recent advance in machine learning is the rapid development of deep learning algorithms that can efficiently extract meaningful features from high-dimensional and complex datasets through a stacked and hierarchical learning process. Deep learning has shown breakthrough performance in several areas including image recognition, natural language processing, and speech recognition. However, the performance of deep learning in predicting disease status using genomic datasets is still not well studied. In this article, we performed a review on the four relevant articles that we found through our thorough literature search. All four articles first used auto-encoders to project high-dimensional genomic data to a low dimensional space and then applied the state-of-the-art machine learning algorithms to predict disease status based on the low-dimensional representations. These deep learning approaches outperformed existing prediction methods, such as prediction based on transcript-wise screening and prediction based on principal component analysis. The limitations of the current deep learning approach and possible improvements were also discussed.


Introduction
Complex human diseases, such as cancers, cardiovascular diseases, and respiratory diseases, have caused huge public health concerns and economic burdens [1,2]. It is believed that both environmental factors, such as smoking exposure, nutrient intake, physical exercise, and genomic factors contribute to the development of complex human diseases [3]. We refer genomic factors to any molecular factors related to genes, such as genotype, mRNA expression, DNA methylation, microRNA expression, metabolites, proteins, etc. Cuttingedge technologies, e.g., genotyping and next-generation whole genome sequencing, greatly facilitate the investigations of the associations of genomic factors to complex human diseases so that researchers can unbiasedly detect disease-associated factors. In addition to uncovering the underlying molecular mechanisms, researchers expect that the diseaseassociated genomic factors could also help diagnose disease, personalize treatment, and develop new medicines [4].
Several machine learning methods, such as support vector machine [5] (SVM), random forest [6], and k-nearest neighbors [7] have been successfully applied in disease prediction based on clinical data [8][9][10]. For genomic data generated by high-throughput technologies (Figure 1), the major challenge in disease prediction is the "curse of dimensionality" [11][12][13], which refers to the scenario where the number of genomic factors is far larger than the number of samples, resulting in model over-fitting and computational inefficiency. Model over-fitting is the phenomenon that a model fits a particular set of data too closely or exactly to fitting additional data or future observations reliably.
A reasonable approach [14,15] to handle the curse of dimensionality is to first apply feature selection techniques to select key features relevant to the disease of interest, and then to predict the disease status based on these key features ( Figure 2). In genomic data analysis, a feature can be a gene transcript or a (non) linear combination of several gene transcripts. Traditional feature selection techniques include forward variable selection, backward variable deletion, stepwise variable selection, transcript-wise tests, or principal component analysis. These methods have limited performance in genomic data analysis. Forward variable selection, backward variable deletion, and stepwise variable selection are timeconsuming. Hence, they are not suitable for whole genome-wide analysis. Transcript-wise tests ignore the fact that many omics variables are correlated and therefore carry redundant information regarding prediction. Ignoring the redundancy would result in the selected transcripts are non-reproducible in independent cohorts [13,16,17]. In addition, contributions of different genomic risk factors might vary; however, transcript-wise tests implicitly assign equal weights to all selected transcripts. Principal component analysis (PCA) explicitly assigns different weights to different transcripts. However, PCA produces a linear combination of transcripts and ignores the possible non-linear relationship between transcripts.
Recently, deep learning methods have made breakthrough progress in image/video recognition [18], natural language processing [19], and robotics [20,21]. Through a stacked and hierarchical learning system, deep learning methods could efficiently capture complex relationships between high-dimensional features, either spatial or consequential [22].
In bioinformatics, deep learning methods have fruitful and innovative applications in medical image classification [23,24], predicting DNA-and RNA-binding proteins sequences [25], and DNA sequence noncoding variants effects predicting [26]. However, using deep learning methods to predict disease status is not a well-researched area.
Many investigators in genomic data analysis fields might hear about deep learning and would like to learn more about it and how it could be used to predict disease status based on genomic data. In this review, we will first introduce the main components of deep learning and the most frequently used deep learning feature extraction methods in genomic data analysis. We will then review the papers that used deep learning to predict complex human diseases based on genomic data. The limitations of the current deep learning approach and possible improvements will also be discussed.
Before we review the details of the four studies, we first introduce in the following sections the main components of deep learning and the most frequently used deep learning feature extraction methods in genomic data analysis.;

Neural Networks (ANNs) and Deep Learning Methods in Predicting Disease
The main component of all deep learning algorithms is Artificial Neural Networks (ANNs). Understanding how ANNs are constructed and trained is the first step to understand deep learning methods.

Artificial neural networks (ANNs)
Artificial neural networks are computing systems that are inspired by the biological neural networks constituting brains. Typically, an ANN is a network of nodes with multilayers: one input layer, one output layer, and several hidden internal layers. Within a layer, nodes are not connected, while between the layers nodes are fully connected (Figures 3 and 4). Each node can store a value. For instance, in Figure 3 Z i is the value stored in the i-th node. Each edge can have a weight. For example, the weight w ji indicates the amount of information passing to the node i in the given layer from the node j in the previous layer. The value of a node on a given layer, except for the input layer, is a function of a bias (i.e., threshold; e.g., b i for the i-th node) and the weighted average values of all nodes on the previous layer. The function is called an activation function. For instance, Y 1 = 1 if(b i + w 1i * Z 1 + … + w ni * Z n ) > 0 and Y 1 = 0 otherwise, where n is the number of nodes in the previous layer and Z j is the value for the j-th node in the previous layer. Usually, activation functions, such as sigmoid, rectified linear unit (ReEU) [30], and hyperbolic tangent (Tanh), are non-linear.

Training ANNs
To estimate the optimal values of the biases and edge weights, a training data set and a validation set are needed, in which the values of the nodes in the output layers are known. The idea is to find a set of biases and edge weights that minimize the difference between the true values and predicted values of nodes in the output layer. The difference is a function of the biases and edge weights and is usually called loss function.
Gradient descent is an optimization method for updating the parameters of a neural network to minimize the loss function ( Figure 5). It uses the fact that optimal parameters are achieved when the gradient of the loss function with respect to the parameters are zero. However, finding parameters that are the solution to zero gradient equation is a nontrivial task for complex networks with a large number of parameters. An alternative method to solving the gradient equation is, starting with an initial point, to iteratively update each parameter proportional to the negative of the gradient of the loss function with respect to the parameter, and continue this procedure until the amount of change of parameters is below a predefined threshold. An important part of this method is to calculate the gradient of loss function with respect to every parameter in the network. Backpropagation is an algorithm for efficiently calculating the gradient for each parameter, using the chain rule: For the simple network in , where Loss(w) is the loss function. This implies that once we know the gradients at some layer, we can easily calculate the gradients for the layer before it.

Deep learning and deep neural networks (DNNs)
ANNs with only one ortwo hidden layershave a shallow architecture, which contains only two levels of data-dependent computational elements and can be very inefficient regarding the number of hidden nodes, and in terms of required training examples [11]. In contrast, deep neural networks are ANNs with more than two hidden layers. This deep architecture can compactly represent a large number of computational elements via the composition of many nonlinearities [11]. Deep learning methods are defined as computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction [22].
The performance of deep learning relies on the methods to train the parameters in DNNs. Intuitively, we can train the parameters by minimizing the prediction error rates (the loss function) by applying gradient descent. However, empirical experiments showed that this supervised approach has poor performance for DNNs [11,31], in the regime where the number of input features is comparable to (or even far larger than) the number of training samples, which is the case in genomic datasets. In contrast, unsupervised learning at each stage of a deep network proposed by the seminal works of Hinton et al. [32] and Hinton and Salakhutdinov [33] pretrains each hidden layer as the encoder of an auto-encoder trying to reconstruct the output of the previous layer. Hence, combining unsupervised approach with the supervised approach, such as fine-tuning all the parameters of the ANN using backpropagation and gradient descent on a global supervised cost function, can significantly improve the performance of deep learning methods for data-sparse datasets [11,31].

Auto-encoder (AE)
An auto-encoder is a type of ANN that aims to find a new representation of input nodes (e.g., gene transcripts in genomic data analysis) in an unsupervised manner, from which the input can be reconstructed without too much loss of information [31]. Like ANN, an autoencoder has one input layer, one output layer, and one or multiple hidden layers ( Figure 6). Suppose X is the original data in a p-dimensional space. An auto-encoder would first project X to a q-dimensional space Y=g 1 (X), where g 1 is a non-linear projection function. Then it transforms Y back to the p-dimensional space Z=g 2 (Y), where g 2 is also a non-linear projection function. The optimal projection Y* minimizes the loss function loss[X, g 2 (Y)] that measures the differences between X and Z=g 2 (Y). Note that since q is different from p, both the projection function g 1 and the projection function g 2 are not one-to-one mapping functions. Hence, the inverse functions g 1 −1 and g 2 −1 do not exist.
Similar to training ANNs, backpropagation and gradient descent can be applied to train an auto-encoder, in which the output layer has the dimension as the original data Z=g 2 (Y)=g 2 (g 1 (X)).
The nodes Y=g 1 (X) within the hidden layer are the representations of original features. The hidden layer is "under-complete" if the number (q) of nodes in the hidden layer is smaller than that (p) in the input layer (q<p). In most cases, auto-encoder outperforms Principal Component Analysis in processing high dimensional complex datasets because auto-encoder performs both linear and non-linear projections, while PCA performs only linear projection. Auto-encoders have been successfully used to efficiently extract meaningful features in disease diagnosis based on high-throughput genomic data [27,34].

Sparse auto-encoder (SpAE)
Performing backpropagation and gradient descent could be inefficient if there are too many free nodes with complex dependencies in each layer [35,36]. Sparse auto-encoder is developed to restrict the number of hidden nodes to be activated by introducing sparsityconstraints on the hidden units ( Figure 7). Sparse auto-encoder have been proved to have favorable performance in image recognition [37] and speech emotion recognition [38], due to its efficiency in extracting meaningful features from high-dimensional data.

Stacked auto-encoder (StAE)
A stacked auto-encoder [11,39,40] is a multi-layer auto-encoder, each hidden layer of which is a representation of the previous layer obtainedby an auto-encoder with one hidden layer ( Figure 8). The training of stacked auto-encoders is often completed by applying the greedy layer-wise pre-training approach [11]. Given extremely high-dimensional input data, a stacked auto-encoder could extract features layer by layer and finally forms a better representation to be passed into classifiers.

Denoising auto-encoder (DAE)
A basic auto-encoder could successfully retain much of the information from the inputs in new features within the hidden layer. However, Vincent et al. [40] demonstrated that simply retaining information from the inputs does not guarantee that the extracted features are "good features", which could achieve high-performance in supervised learning tasks. Denoising auto-encoder has been proposed to overcome this challenge by generating a noisy representation based on the inputs, such as setting values to 0 for a small proportion of input nodes or adding a noise term with a Gaussian distribution, and then feeding the noisy term into the auto-encoder ( Figure 9). With the introduction of the noise term to the original inputs, denoising auto-encoders construct more robust feature representations and thereby could generalize better to unseen examples and datasets.

Stacked denoising auto-encoder (SDAE) and stacked sparse auto-encoder (SSAE)
An SDAE is a multi-layer auto-encoder, each hidden layer of which is a representation of the previous layer obtained by a denoising auto-encoder with one hidden layer. For example, when pre-train the 2 hidden layers h 1 and h 2 in Figure 8, one could add a noise term to the pre-training inputs X and h 1 to construct SDAE. Vincent et al. [40] showed that the features extracted by SDAE are stable and robust under noisy inputs, by achieving the best classification results under 9 out of 10 image databases. These features could efficiently capture useful information in the input distribution and have yield equivalent or better classification performance over most of the image data processing benchmarks. Similar to SDAE, an SSAE is obtained when the number of hidden units to be activated is restricted on each hidden layer of a stacked auto-encoder. Xu et al. [41] applied SSAE on Breast Cancer detection using image data. The study shows that SSAE outperformed 9 other state of the art cancer detection strategies and improved F-measure to 84.49%.

Previous works of disease prediction in genomic data Analysis using non-deep learning approach
Plenty of methods have been proposed in disease prediction using genomic data (e.g., [42][43][44][45][46][47]). Due to the large number of predictors (i.e., gene transcripts), the main approach in disease detection/prediction is to first obtain a subset of gene transcripts (e.g., a few top gene transcripts in transcript-wise tests) ora subset of representations of gene transcripts (e.g., a few top principal components), and then to predict disease status based on the selected transcripts or representations using machine learning algorithms.
Furey et al. [42] used SVMs to classify cancer tissue samples using gene expression datasets. The study showed that SVMs are able to classify tissue and cell types based on gene expression data and have similar performances to other machine learning methods.
Khan et al. [43] was among the first to adopt basic ANNs (ANNs without hidden layers) to classify cancer samples and to identify relevant genes. In their study, the 10 top PCA components were used as inputs to the ANN to classify the small, round blue-cell tumors (SRBCT) to four distinct diagnostic categories. All 63 samples in the training set and all 25 samples in the independent testing set were correctly classified based on the 96 selected genes. Pal et al. [44] proposed to combine modified perceptron network and relational fuzzy clustering algorithms [48] to select a gene subset used for cancer subgroup classification. They applied their method to the SRBCT dataset analyzed by Khan et al. [43] and identified 7 genes that can accurately classify the samples in both training set and testing set. Chang et al. [45] used an ANN with one hidden layer coupled with an additive step-wise approach for predicting colorectal cancer (CRC) using microRNAs (miRNAs). Three miRNAs were identified with a median accuracy 100% by using an extensive Monte Carlo cross-validation strategy. Sharma et al. [15] proposed a top-r feature selection technique that repeatedly divides and merge gene expression data to select the gene subset minimizing the loss of information.
The selected genes are then tested on three tumor datasets and achieved higher accuracies than other feature selection methods, such as transcript-wise tests. Nanni et al. [46] examined the SVM classification performance using multiple feature reduction and data transformation approaches, including neighborhood preserving embedding, orthogonal wavelet coefficients, and texture descriptors. The study showed that a combination of different feature extraction methods could enhance genomic classification performance. For instance, the two combined methods achieved the highest average area under ROC curves (AUC) (AUC=92.18% for the WF method and 92.07% for the FUS method), while the AUC values for the 8 individual feature extraction methods were ranged from 79.24% to 91.85%. Jordan and Do [47] reviewed the studies that predict disease using full genomic information. Their review focused on polygenic risk scores (PRS), which is the most common method of integrating information from across the genome into a single estimate of genetic risk. A PRS is a weighted average of the genetic status at each associated risk locus. The weighting of each locus is usually the regression coefficient of genomewide association study (GWAS) association for the locus. Jordan and Do [47] mentioned that the power of most PRSs to predict disease risk has been very low due to several reasons, such as small sample size, genetic ancestry, heterogeneity of risk factors and causation.
The main limitations of these previous works [13] include (1) ignoring potential non-linear relationships among the features; (2) ignoring the contribution of features with weak signals to distinguish diseases; and (3) over-simplifying the complex prediction problem, such as using single-layer ANNs.

Deep learning applications in disease prediction
Through a thorough literature search, we identified four papers [13,[27][28][29] published between January 2013 and December 2017, which applied deep learning methods in disease prediction using genomic data ( Table 1). The details of the four studies will be discussed below.
Fakoor et al. [13] is among the first to apply deep learning methods to extract key features from gene microarray data in predicting cancers. Fakoor et al. [13] compared three auto-encoders methods: a sparse auto-encoder with one hidden layer, a stacked auto-encoder with 2 hidden layers, and a stacked auto-encoder with fine-tuning. They first applied PCA to eliminate the effects of redundant and noisy dimensions, then applied the three autoencoders methods to further extract non-linearly-correlated discriminating features based on the top principal components combined with some randomly selected original features, and finally used softmax regression to do classification based on the low-dimensional representations ( Table 2). Thirteen gene microarray datasets were used to compare the performances of deep learning methods and two traditional prediction methods: Softmax based on the top principal components (PCA+Softmax) or SVM with Gaussian kernel based on the top principal components (PCA+SVM). The range of sample sizes of the 13 datasets is 20-1,047; the range of the numbers of features is 2,000-54,613. Ten-fold cross-validation was applied to estimate the average and standard deviation of the prediction accuracies and compared the average Accuracy (ACC) of the three deep-learning methods with the maximum of the accuracy of the two traditional methods.  [27] has four layers: an input layer, a corrupted input layer, a hidden layer, and a reconstructed input layer. Each node in the hidden layer was used to predict disease status (e.g., tumor vs. non-tumor, or ER + vs. ER−) depending on whether the node value for a sample in the evaluation set is greater than the optimal threshold that was obtained based on the discovery set and testing set. Tan et al. [27] showed that each of the top three hidden nodes in the discovery set could also have high prediction accuracy (>0.9) in the evaluation set when they used their method to predict tumor status (tumor sample vs. non-tumor sample).
Danaee et al. [28] used SDAE to transform high dimensional, noisy RNA-seq gene expression data to lower dimensional, meaningful representations, based on which they applied different machine learning methods to classify breast cancer samples from the healthy control samples. They also identified a set of "Deeply Connected Genes" (DCGs) that have strongly propagated influence on the reduced-dimension SDAE-encoding. Inspired by the classic study that applies SDAE to extract features in image data [40] Danaee et al. [28] built an SDAE model with four stacked layers of dimensions of 15,000, 10,000, 2,000, and 500, to obtain representations of genomic features to be fed into classifiers. An RNAseq from TCGA is used to train and validate the model in the study. The dataset containsl, 210 samples, including 1,097 breast cancer samples and 113 healthy samples. Danaee et al.
[28] compared their prediction method with prediction methods based on PCA, Kernel PCA (KPCA, a non-linear PCA), the 206 differentially expressed genes (DIFFEXP0.05) that were significant at an FDR of 0.05 in gene-wised tests, and top 500 most significant differentially expressed genes (DIFFEXP500). Three classifiers, including a single-layer ANN, SVM, and SVM-RBF (SVM with a radial basis function kernel), were used to do the prediction based on extracted features. Like Tan et al. [27], Danaee et al. [28] used a training set and a testing set to train classifiers and used a validation set to evaluate the performance of the prediction methods. The classification result shows that the lowdimensional representations by SDAE outperformed other four sets of extracted features. For example, SDAE+SVM-RBF had accuracy (98.26%), sensitivity (97.61%), specificity (99.11%), precision (99.17%), and Fscore [49] (0.983). Furthermore, Danaee et al. [28] showed that DCGs had slightly lower prediction accuracy than SDAE-extracted low-dimensional representations, but much higher prediction accuracy than the other methods.
Singh et al. [29] applied a stacked sparse auto-encoder (SSAE) to extract features to predict disease status for each of 36 datasets from the Gene Expression Machine Learning Repository (GEMLeR) [50]. The SSAE used by Singh et al. [29] has three hidden layers.

Software packages for deep-learning-based feature extraction
Since deep learning algorithms usually are complicated, it is important to have open-source software packages available so that investigators can directly use these packages to their genomic data analysis. Both Tan et al. [27] and Danaee et al. [28] used Theano software that provides the implementation of auto-encoder algorithms. Fakoor et al. [13] and Singh et al. [29] did not mention the software packages that they used for auto-encoding.

Discussion
In this article, we aimed to review all papers that applied the deep learning approach to predict disease status based on genomic data, which first obtains low-dimensional representations of high-dimensional genomic features, and then inputs these representations to the state-of-art classifiers that have excellent performance in lowdimensional classification problems. We found only 4 such papers, indicating that it is still in its infancy to predict disease status using deep learning on genomic data. However, the results of these 4 papers showed that the deep learning approach could extract useful genomic features from high-throughput whole genome data for prediction purpose with high accuracy.
Compared with commonly-used dimension-reduction methods, such as PCA and transcriptwise testing, the deep learning approach could have better performance in terms of a variety of accuracy measurements: ACC, AUC, sensitivity, specificity, precision, and F-score. Especially, it is impressive that transcript-wise testing, which is currently the most popular approach to identify disease-associated transcripts, performed poorly compared with PCA or auto-encoders [28]. However, whether the performance of the deep learning approach is significantly better than the commonly used approaches was not investigated in the 4 papers, among which only Fakoor et al. [13] provided standard errors for the estimated ACC. However, Fakoor et al. [13] did not provide some key details, such as the number of principal components used and the number of randomly selected raw features. They also did not provide p-values for testing if the mean ACC obtained using a deep learning approach is significantly better than that by using the PCA approach. Moreover, Fakoor et al. [13] showed that not all auto-encoders could outperform PCA. For example, Table 1 of Fakoor et al. [13] showed that for the first dataset, mean ACC (standard error) is 74.36% (0.062%) by using PCA+sparse auto-encoder, 51.35% (0.019%) by using PCA+stacked auto-encoder, while PCA approach had mean ACC 94.04% (SE 0.03%), although PCA+stacked autoencoder with fine tuning (95.15% (0.047%)) performed better than PCA.
Different auto-encoders were used in the 4 papers, such as sparse auto-encoder, stacked auto-encoder, stacked auto-encoder with fine-tuning, denoising auto-encoder, stacked denoising auto-encoder, and stacked sparse auto-encoder. Except Fakoor et al. [13], the other three papers did not compare the auto-encoders used in the paper with other auto-encoders. Table 1 of Fakoor et al. [13] showed that PCA+stacked auto-encoder performed worse than PCA+sparse auto-encoder and PCA+stacked auto-encoder with fine-tuning in 12 of the 13 datasets. However, neither PCA+sparse auto-encoder nor PCA+stacked auto-encoder with fine-tuning could outperform each other in all 13 datasets. For a fair comparison, it could be beneficial for future studies to compare the deep learning methods mentioned above using the same datasets.
All four papers mentioned the number of hidden layers and the number of nodes in each hidden layer used for the auto-encoders. However, no justifications and guidance were given on why choosing those specific numbers of hidden layers and those specific numbers of nodes in each hidden layer. This is probably one of the main reasons why deep learning has not been widely used in the genomic research area. There are some existing methods to choose the number of layers and nodes, such as (1) starting from a small neural network and adding layers and nodes until the error stops decreasing, and (2) starting from a big neural network and remove layer and nodes until the error increases significantly [51]. Optimization methods such as grid search and random search are also proposed and discussed [52] to optimize the parameters in model training. However, these methods are still not well studied in genomic data analysis and could not eliminate the risks of over-fitting and under-fitting. Future research is still needed in choosing and optimizing deep learning parameters, especially in genomic data analysis.
Another possible reason why deep learning has not been widely used in the genomic research area is the lack of software packages that implement deep learning algorithms for genomic data analysis. Many investigators in genomic research area use the R language and use packages in Bioconductor, a repository of R packages specifically for genomic data analysis. Although there are a couple of R packages, such as keras and kerasR, connecting R to the Keras deep learning library, there is lack of examples and tutorials on how to use them to analyze genomic data and to visualize the low-dimensional representations that are obtained by auto-encoders.
It is a non-trivial task to interpret the low-dimensional representations (features) of the original expression data obtained by auto-encoders because the representations are nonlinear functions of gene transcripts and the hidden layers in deep learning algorithms are like "black box" [53]. Among the 4 papers that we reviewed, Tan et al. [27] and Danaee et al. [28] suggested interpreting the representations based on the transcripts having strongly propagated influence on the reduced-dimension auto-encoding. However, no details were given on how to select these transcripts, except that these transcripts have high edge weights.
To evaluate classification performance, several measurements were used in the four papers that we reviewed, including accuracy (ACC), area under the ROC curve (AUC), sensitivity, specificity, precision, and F-measure. We call a dataset is imbalanced if the number of cases/ positive samples is much different from that of controls/negative samples. When the dataset is imbalanced, using ACC could be biased. For example, given a dataset with 99% true negative samples and 1% true positive samples, a classifier could achieve 99% ACC even if it wrongly classifies all the true positive samples to the negative group. Fakoor et al. [13] only used ACC as the performance metric, while several genomic datasets analyzed in Fakoor et al. [13] are imbalanced. Tan et al. [27] also only used ACC to evaluate the performances of different prediction methods, while both the training and testing datasets are highly imbalanced. For imbalanced data, other performance metrics can be used, such as AUC, F-measure, and G-measure [49,54], which are less sensitive to the case/control imbalance.
Over-fitting is a big issue in prediction. Using the same data set to both train the prediction model and evaluate the performance of the prediction model usually causes over-estimation of the prediction accuracy. Ideally, a testing set from a population independent of the training population is required in evaluating prediction accuracy. However, genomic data are usually expensive to collect. Hence, it is usually hard to obtain independent testing set in genomic research. Thanks to the policy of the National Institute of Health of the United States, numerous genomic datasets are now publicly available in the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/), an online repository of genomic datasets.
Other public genomic repositories are also available, such as TCGA (https:// cancergenome.nih.gov) and GTEx (https://www.gtexportal.org/home/). Hence, nowadays it is relatively easy to obtain an independent testing set for most complex human diseases.
Among the 4 papers that we reviewed, only Tan et al. [27] used an independent testing set. The other 3 papers used K-fold cross-validation technique to alleviate the over-fitting issue.
Genomic data usually contain many sources of technical noise, such as batch effects due to that large samples have to be handled in multiple batches due to capacity limits of machines. Several methods, such as ComBat [55], have been proposed to remove the effects of technical batches before downstream data analysis. We can apply ComBat to the training set and the testing set, separately. Suppose after removing technical noises we build and validate a prediction model based on the training set and the testing set, with excellent prediction accuracy. Now a new subject's genomic data are obtained. Can we apply the prediction model to this new subject? The answer probably is "no", since we do not know how to remove technical noises for only one new sample. One possible solution is to collect genomic data for a batch of subjects together. Then we can apply the prediction model to subjects in this batch after removing possible batch effects. A possibly better solution is to improve technology to reduce technical noises. With the advancements in sequencing technology and a rapid decline in sequencing costs, DNA sequencing has gained remarkable popularity among biomedical researchers. Compared to microarrays, DNA sequencing data is believed to deliver faster, more complete, and more scientifically accurate genomic analysis [56].
The four deep-learning papers identified in this review compared the performances of deep learning approaches with PCA approach and transcript-wise test approach. There are many more advanced feature selection methods in the literature, such as the stable feature selection method [16] and the Boruta algorithm [17]. More comprehensive comparisons are warranted.
Recently, the authors [29] improved their results using deep transfer learning [57]. Moreover, semi-supervised learning and reinforcement learning are receiving a lot of attention to image recognition, gaming, and robotics [58][59][60]. How to apply the frontier deep learning innovations to genomic data analysis could be an interesting future research topic [61].
Finally, we would like to mention a few related review articles on deep learning. All these reviews are pretty broad and do not focus on prediction of disease status using genomic data. Ching et al. [61] examined applications of deep learning to a variety of biomedical problems, including patient classification. They only briefly mentioned disease prediction based on autoencoders and cited Tan et al. [27]. Miotto et al. [62] reviewed the application of deep learning in the healthcare domain and cited Fakoor et al. [13]. Mamoshina et al. [63] reviewed the application of deep learning in biomedicine. Angermueller et al. [64] reviewed the application of deep learning in regulatory genomics and cellular imaging. To the best of our knowledge, our review is the first focusing on the prediction of disease status based on deep learning, which is an important component in personalized medicine.

Conclusion
In summary, this review showed that applying deep learning to find a low-dimensional representation for high-throughput genomic data is a promising future trend in disease prediction based on high-dimensional genomic data. The low-dimensional representation obtained by deep learning could capture both linear and non-linear relationship among the transcripts. Deep learning is a new technology for most scientists in genetics. Scientists in genetics should collaborate to understand how deep learning could help predict disease status using genomic data, hence to move this field forward. An illustration of gene expression data. In the above figure, each row represents 1 gene transcript and each column represents one sample (one person). The (i,j) cell records the expression level of the i-th gene transcript for the j-th sample. Gene expression data typically have high dimensionality (20,000-50,000 gene transcripts) and small sample size (<1000), resulting in the "curse of dimensionality problem". An illustration of building prediction models using genomic datasets. The idea is to first reduce the dimensionality of the input features and then feed the low dimensionality features into prediction model/classifiers. Dimensionality reduction techniques typically include transcript-wise testing, principal component analysis (PCA), and auto-encoders. An illustration of a simple ANN: This simple feed-forward ANN has four input nodes and one output node. On the edges, w 1 -w 4 represent the weights of the input nodes. The value Y 1 for the output node is computed as

Figure 4:
An illustration of a multiple-layer ANN. This multiple-layer ANN has one input layer, two hidden layers, and one output layer, with each layer connected to the previous layer. The activation function f is applied to each node on the hidden layer and the output layer.
Wu et al. Page 20

Figure 5:
Gradient Descent Training. The x-axis is the weight w and the y-axis is the loss function Loss (w). In Gradient Descent optimization, learning rate represents how much the edge weights are adjusted in each step before the global minimum is achieved. Learning rate could also be seen as the "step size" in the learning process. With a higher learning rate, the gradients are adjusted by a greater amount each step. With a lower learning rate, the gradients are adjusted by a smaller amount each step. Illustration of a basic auto-encoder. This auto-encoder has 2 hidden units. X is the inputs, Y = X is the reconstructed inputs in the output layer, h is the hidden layer. The dimension of the original input data is reduced from p=4 to q=2. The optimal representation in the qdimensional space is obtained by minimizing the difference between the inputs X and the reconstructed inputs Y Illustration of a sparse auto-encoder: A sparse auto-encoder restricts the number of hidden layers activated by adding a sparsity term to the loss function. The sparsity term set the expected activation value of the hidden nodes to a small constant so that most of the hidden nodes' activations are near zero. Hence, very few hidden nodes are activated in a sparse auto-encoder.
wired on top of the hidden layers to make predictions. The whole architecture could be finetuned together using backpropagation and labeled data, which is shown in (c).  A summary of different auto-encoders.

Method Description
Regular auto-encoder (AE) Find low-dimensional representation of input using an unsupervised approach (i.e., no outcome information is used) Sparse AE (SpAE) Restrict the number of hidden nodes to be activated to avoid too many free nodes with complex dependencies in each layer Stacked AE (StAE) Each hidden layer is a low-dimensional representation of the previous layer obtained by AE Denoising AE (DAE) Introduce noises to input to make AE more robust to noises Stacked denoising AE (SDAE) Combine stacked AE and DAE (i.e., introduce noises to input in a stacked AE) Stacked sparse AE (SSAE) Combine stacked AE and SpAE (i.e., introduce sparse restriction on the stacked AE hidden layers).